{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In River, the features of a sample are stored inside a dictionary, which in Python is called a `dict` and is a native data structure. In other words, we don't use any sophisticated data structure, such as a `numpy.ndarray` or a `pandas.DataFrame`.\n",
    "\n",
    "The main advantage of using plain `dict`s is that it removes the overhead that comes with using the aforementioned data structures. This is important in a streaming context because we want to be able to process many individual samples in rapid succession. Another advantage is that `dict`s allow us to give names to our features. Finally, `dict`s are not typed, and can therefore store heterogeneous data.\n",
    "\n",
    "Another advantage which we haven't mentioned is that `dict`s play nicely with Python's standard library. Indeed, Python contains many tools that allow manipulating `dict`s. For instance, the `csv.DictReader` can be used to read a CSV file and convert each row to a `dict`. In fact, the `stream.iter_csv` method from River is just a wrapper on top of `csv.DictReader` that adds a few bells and whistles.\n",
    "\n",
    "River provides some out-of-the-box datasets to get you started."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:49:54.824345Z",
     "iopub.status.busy": "2023-12-04T17:49:54.823858Z",
     "iopub.status.idle": "2023-12-04T17:49:55.245273Z",
     "shell.execute_reply": "2023-12-04T17:49:55.244942Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Bike sharing station information from the city of Toulouse.\n",
       "\n",
       "The goal is to predict the number of bikes in 5 different bike stations from the city of\n",
       "Toulouse.\n",
       "\n",
       "      Name  Bikes                                                         \n",
       "      Task  Regression                                                    \n",
       "   Samples  182,470                                                       \n",
       "  Features  8                                                             \n",
       "    Sparse  False                                                         \n",
       "      Path  /Users/max/river_data/Bikes/toulouse_bikes.csv                \n",
       "       URL  https://maxhalford.github.io/files/datasets/toulouse_bikes.zip\n",
       "      Size  12.52 MB                                                      \n",
       "Downloaded  True                                                          "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from river import datasets\n",
    "\n",
    "dataset = datasets.Bikes()\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that when we say \"loaded\", we don't mean that the actual data is read from the disk. On the contrary, the dataset is a streaming data that can be iterated over one sample at a time. In Python lingo, it's a [generator](https://realpython.com/introduction-to-python-generators/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the first sample:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:49:55.246985Z",
     "iopub.status.busy": "2023-12-04T17:49:55.246871Z",
     "iopub.status.idle": "2023-12-04T17:49:55.257529Z",
     "shell.execute_reply": "2023-12-04T17:49:55.257231Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'moment': datetime.datetime(2016, 4, 1, 0, 0, 7),\n",
       " 'station': 'metro-canal-du-midi',\n",
       " 'clouds': 75,\n",
       " 'description': 'light rain',\n",
       " 'humidity': 81,\n",
       " 'pressure': 1017.0,\n",
       " 'temperature': 6.54,\n",
       " 'wind': 9.3}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x, y = next(iter(dataset))\n",
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each dataset is iterable, which means we can also do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:49:55.258972Z",
     "iopub.status.busy": "2023-12-04T17:49:55.258882Z",
     "iopub.status.idle": "2023-12-04T17:49:55.267622Z",
     "shell.execute_reply": "2023-12-04T17:49:55.267371Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'moment': datetime.datetime(2016, 4, 1, 0, 0, 7),\n",
       " 'station': 'metro-canal-du-midi',\n",
       " 'clouds': 75,\n",
       " 'description': 'light rain',\n",
       " 'humidity': 81,\n",
       " 'pressure': 1017.0,\n",
       " 'temperature': 6.54,\n",
       " 'wind': 9.3}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "for x, y in dataset:\n",
    "    break\n",
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the values have different types.\n",
    "\n",
    "Under the hood, calling `for x, y in dataset` simply iterates over a file and parses each value appropriately. We can do this ourselves by using `stream.iter_csv`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:49:55.269019Z",
     "iopub.status.busy": "2023-12-04T17:49:55.268947Z",
     "iopub.status.idle": "2023-12-04T17:49:55.277375Z",
     "shell.execute_reply": "2023-12-04T17:49:55.277124Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'moment': '2016-04-01 00:00:07',\n",
       "  'bikes': '1',\n",
       "  'station': 'metro-canal-du-midi',\n",
       "  'clouds': '75',\n",
       "  'description': 'light rain',\n",
       "  'humidity': '81',\n",
       "  'pressure': '1017.0',\n",
       "  'temperature': '6.54',\n",
       "  'wind': '9.3'},\n",
       " None)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from river import stream\n",
    "\n",
    "X_y = stream.iter_csv(dataset.path)\n",
    "x, y = next(X_y)\n",
    "x, y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a couple things that are wrong. First of all, the numeric features have not been casted into numbers. Indeed, by default, `stream.iter_csv` assumes that everything is a string. A related issue is that the `moment` field hasn't been parsed into a `datetime`. Finally, the target field, which is `bikes`, hasn't been separated from the rest of the features. We can remedy to these issues by setting a few parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:49:55.278741Z",
     "iopub.status.busy": "2023-12-04T17:49:55.278667Z",
     "iopub.status.idle": "2023-12-04T17:49:55.287392Z",
     "shell.execute_reply": "2023-12-04T17:49:55.287176Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'moment': datetime.datetime(2016, 4, 1, 0, 0, 7),\n",
       "  'station': 'metro-canal-du-midi',\n",
       "  'clouds': 75,\n",
       "  'description': 'light rain',\n",
       "  'humidity': 81,\n",
       "  'pressure': 1017.0,\n",
       "  'temperature': 6.54,\n",
       "  'wind': 9.3},\n",
       " 1)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_y = stream.iter_csv(\n",
    "    dataset.path,\n",
    "    converters={\n",
    "        'bikes': int,\n",
    "        'clouds': int,\n",
    "        'humidity': int,\n",
    "        'pressure': float,\n",
    "        'temperature': float,\n",
    "        'wind': float\n",
    "    },\n",
    "    parse_dates={'moment': '%Y-%m-%d %H:%M:%S'},\n",
    "    target='bikes'\n",
    ")\n",
    "x, y = next(X_y)\n",
    "x, y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's much better. We invite you to take a look at the `stream` module to see for yourself what other methods are available. Note that River is first and foremost a machine learning library, and therefore isn't as much concerned about reading data as it is about statistical algorithms. We do however believe that the fact that we use dictionary gives you, the user, a lot of freedom and flexibility.\n",
    "\n",
    "The `stream` module provides helper functions to read data from different formats. For instance, you can use the `stream.iter_sklearn_dataset` function to turn any scikit-learn dataset into a stream."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:49:55.288856Z",
     "iopub.status.busy": "2023-12-04T17:49:55.288786Z",
     "iopub.status.idle": "2023-12-04T17:49:55.328888Z",
     "shell.execute_reply": "2023-12-04T17:49:55.328647Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'age': 0.038075906433423026,\n",
       "  'sex': 0.05068011873981862,\n",
       "  'bmi': 0.061696206518683294,\n",
       "  'bp': 0.0218723855140367,\n",
       "  's1': -0.04422349842444599,\n",
       "  's2': -0.03482076283769895,\n",
       "  's3': -0.04340084565202491,\n",
       "  's4': -0.002592261998183278,\n",
       "  's5': 0.019907486170462722,\n",
       "  's6': -0.01764612515980379},\n",
       " 151.0)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import datasets\n",
    "\n",
    "dataset = datasets.load_diabetes()\n",
    "\n",
    "for x, y in stream.iter_sklearn_dataset(dataset):\n",
    "    break\n",
    "\n",
    "x, y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To conclude, let us shortly mention the difference between *proactive learning* and *reactive learning* in the specific context of online machine learning. When we loop over a data with a `for` loop, we have the control over the data and the order in which it arrives. We are proactive in the sense that we, the user, are asking for the data to arrive.\n",
    "\n",
    "In contract, in a reactive situation, we don't have control on the data arrival. A typical example of such a situation is a web server, where web requests arrive in an arbitrary order. This is a situation where River shines. For instance, in a [Flask](https://flask.palletsprojects.com/en/1.1.x/) application, you could define a route to make predictions with a River model as so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:49:55.330316Z",
     "iopub.status.busy": "2023-12-04T17:49:55.330224Z",
     "iopub.status.idle": "2023-12-04T17:49:55.392538Z",
     "shell.execute_reply": "2023-12-04T17:49:55.392188Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import flask\n",
    "\n",
    "app = flask.Flask(__name__)\n",
    "\n",
    "@app.route('/', methods=['GET'])\n",
    "def predict():\n",
    "    payload = flask.request.json\n",
    "    river_model = load_model()\n",
    "    return river_model.predict_proba_one(payload)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Likewise, a model can be updated whenever a request arrives as so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:49:55.394374Z",
     "iopub.status.busy": "2023-12-04T17:49:55.394235Z",
     "iopub.status.idle": "2023-12-04T17:49:55.404823Z",
     "shell.execute_reply": "2023-12-04T17:49:55.404450Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "@app.route('/', methods=['POST'])\n",
    "def learn():\n",
    "    payload = flask.request.json\n",
    "    river_model = load_model()\n",
    "    river_model.learn_one(payload['features'], payload['target'])\n",
    "    return {}, 201"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To summarize, River can be used in many different ways. The fact that it uses dictionaries to represent features provides a lot of flexibility and space for creativity."
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "cd50771599b57adf6ea700641441f38e32b572b636c349c3951bac686eaf769f"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
